Confirming the statistically significant superiority of tree-based machine learning algorithms over their counterparts for tabular data.
PLoS One
; 19(4): e0301541, 2024.
Article
in En
| MEDLINE
| ID: mdl-38635591
ABSTRACT
Many individual studies in the literature observed the superiority of tree-based machine learning (ML) algorithms. However, the current body of literature lacks statistical validation of this superiority. This study addresses this gap by employing five ML algorithms on 200 open-access datasets from a wide range of research contexts to statistically confirm the superiority of tree-based ML algorithms over their counterparts. Specifically, it examines two tree-based ML (Decision tree and Random forest) and three non-tree-based ML (Support vector machine, Logistic regression and k-nearest neighbour) algorithms. Results from paired-sample t-tests show that both tree-based ML algorithms reveal better performance than each non-tree-based ML algorithm for the four ML performance measures (accuracy, precision, recall and F1 score) considered in this study, each at p<0.001 significance level. This performance superiority is consistent across both the model development and test phases. This study also used paired-sample t-tests for the subsets of the research datasets from disease prediction (66) and university-ranking (50) research contexts for further validation. The observed superiority of the tree-based ML algorithms remains valid for these subsets. Tree-based ML algorithms significantly outperformed non-tree-based algorithms for these two research contexts for all four performance measures. We discuss the research implications of these findings in detail in this article.
Full text:
1
Collection:
01-internacional
Database:
MEDLINE
Main subject:
Algorithms
/
Machine Learning
Limits:
Humans
Language:
En
Journal:
PLoS One
Journal subject:
CIENCIA
/
MEDICINA
Year:
2024
Document type:
Article
Affiliation country:
Country of publication: